class: center, middle, inverse, title-slide .title[ # ISA 419: Data-Driven Security ] .subtitle[ ## 16: Machine Learning Regression Models ] .author[ ###
Fadel M. Megahed, PhD
Professor
Farmer School of Business
Miami University
@FadelMegahed
fmegahed
fmegahed@miamioh.edu
Automated Scheduler for Office Hours
] .date[ ### Spring 2025 ] --- ## Quick Refresher of Last Class ✅ Describe the basic concepts of regression analysis, including the roles of independent and dependent variables. ✅ Assess regression models using metrics like R-squared and Mean Squared Error and interpret the results. ✅ Describe the two modeling mindsets: explanatory and predictive. --- ## Learning Objectives for Today's Class - Recognize the differences and similarities between traditional regression and machine learning regression models - Describe advanced techniques like Ridge and Lasso - Explore non-linear models like Decision Trees and Random Forests for regression tasks - Apply machine learning regression models to cyber security data sets --- class: inverse, center, middle # Differences Between Traditional (Statistical) and Algorithmic (Machine Learning) Modeling Cultures --- ## Statistical Modeling Culture Find a stochastic model of the data-generating process in the form of: `\(Y = f(X, \, parameters, \, \epsilon)\)` <img src="data:image/png;base64,#../../figures/stat_model_culture.png" width="25%" style="display: block; margin: auto;" /> .pull-left[ .center[**Assumptions**] .font90[ - Specific stochastic model of the data-generating process - Distributional assumptions about the errors - Linearities - Manual specification of interactions ] ] .pull-right[ .center[**Problems**] .font90[ - Conclusions about model, and not about "nature" - Assumptions often violated - Often no model evaluation (no focus on prediction) ] ] .footnote[ <html> <hr> </html> **Image Source**: Chris Molnar. (2014). [Statistical Modeling: The Two Cultures - Based on Leo Breiman's paper](https://www.slideshare.net/christophmolnar/presentation-on-statistical-modeling-the-two-cultures) ] --- ## Algorithmic (Machine Learning) Modeling Culture Find a function `\(f(x)\)` that minimizes the loss `\(L(Y, \, f(X))\)` <img src="data:image/png;base64,#../../figures/alg_model_culture.png" width="25%" style="display: block; margin: auto;" /> .pull-left[ .center[**Assumptions**] .font90[ - No assumptions about the data-generating process (true mechanism of generating the data is **unknown** and not of direct interest). - No assumptions about the distribution of the errors. ] ] .pull-right[ .center[**Problems**] .font90[ - Some models are "black boxes". - Stability of the model is not guaranteed (and honestly this is also a problem in statistical modeling). ] ] .footnote[ <html> <hr> </html> **Image Source**: Chris Molnar. (2014). [Statistical Modeling: The Two Cultures - Based on Leo Breiman's paper](https://www.slideshare.net/christophmolnar/presentation-on-statistical-modeling-the-two-cultures) ] --- ## Statistical Modeling: The Two Cultures <img src="data:image/png;base64,#../../figures/two_cultures.png" width="40%" style="display: block; margin: auto;" /> .footnote[ <html> <hr> </html> **Note**: You are highly encouraged to read Leo Breiman's paper on the two cultures of statistical modeling. It's a classic and a must-read for anyone interested in data science and machine learning. See [Breiman 2001](https://www.jstor.org/stable/2676681). ] --- class: inverse, center, middle # Advanced Techniques for Regression --- ## Ridge Regression <img src="data:image/png;base64,#../../figures/Ridge_Regression_print.png" width="72%" style="display: block; margin: auto;" /> .footnote[ <html> <hr> </html> **Image Source:** Chris Albon. (2017). [Machine Learning Flashcards](https://machinelearningflashcards.com/). Flashcards were purchased by the author and are not available for free. ] --- ## Lasso Regression <img src="data:image/png;base64,#../../figures/Lasso_For_Feature_Selection_print.png" width="72%" style="display: block; margin: auto;" /> .footnote[ <html> <hr> </html> **Image Source:** Chris Albon. (2017). [Machine Learning Flashcards](https://machinelearningflashcards.com/). Flashcards were purchased by the author and are not available for free. ] --- class: inverse, center, middle # Non-Linear Models for Regression --- ## Decision Trees (Classification and Regression Trees) <img src="data:image/png;base64,#../../figures/Decision_Trees_print.png" width="72%" style="display: block; margin: auto;" /> .footnote[ <html> <hr> </html> **Image Source:** Chris Albon. (2017). [Machine Learning Flashcards](https://machinelearningflashcards.com/). Flashcards were purchased by the author and are not available for free. ] --- ## Decision Trees (Classification and Regression Trees) <img src="data:image/png;base64,#../../figures/Decision_Tree_Regression_print.png" width="72%" style="display: block; margin: auto;" /> .footnote[ <html> <hr> </html> **Image Source:** Chris Albon. (2017). [Machine Learning Flashcards](https://machinelearningflashcards.com/). Flashcards were purchased by the author and are not available for free. ] --- ## Random Forests <img src="https://miro.medium.com/v2/resize:fit:4800/format:webp/1*ZFuMI_HrI3jt2Wlay73IUQ.png" width="72%" style="display: block; margin: auto;" /> .footnote[ <html> <hr> </html> **Image Source:** Chaya. (2020). [Random Forest Regression](https://levelup.gitconnected.com/random-forest-regression-209c0f354c84). ] --- ## Random Forests <img src="data:image/png;base64,#../../figures/The_Random_In_Random_Forest_print.png" width="72%" style="display: block; margin: auto;" /> .footnote[ <html> <hr> </html> **Image Source:** Chris Albon. (2017). [Machine Learning Flashcards](https://machinelearningflashcards.com/). Flashcards were purchased by the author and are not available for free. ] --- class: inverse, center, middle # Applications to Cybersecurity Datasets --- ## Example: Predicting Flow Duration in IoT Networks ``` python from ucimlrepo import fetch_ucirepo # fetch dataset rt_iot2022 = fetch_ucirepo(id=942) # data (as pandas dataframes) X = rt_iot2022.data.features y = rt_iot2022.data.targets df = pd.concat([X, y], axis=1) # subset the features and dataset df = ( df .sample(frac=0.1, random_state=2025) .reset_index(drop=True) .loc[:, ['proto', 'service', 'fwd_pkts_payload.avg', 'bwd_pkts_payload.avg', 'fwd_subflow_pkts', 'bwd_subflow_pkts', 'Attack_type', "flow_duration"]] ) ``` .footnote[ <html> <hr> </html> **Note**: The dataset is from the UCI Machine Learning Repository. It contains features of network traffic in IoT networks and the target variable is the flow duration. The goal is to predict the flow duration using the features. For more information, see [UCI RT-IoT2022](https://archive.ics.uci.edu/dataset/942/rt-iot2022). ] --- class: inverse, center, middle # Recap --- ## Summary of Main Points By now, you should be able to do the following: - Recognize the differences and similarities between traditional regression and machine learning regression models - Describe advanced techniques like Ridge and Lasso - Explore non-linear models like Decision Trees and Random Forests for regression tasks - Apply machine learning regression models to cyber security data sets --- ## 📝 Review and Clarification 📝 1. **Class Notes**: Take some time to revisit your class notes for key insights and concepts. 2. **Zoom Recording**: The recording of today's class will be made available on Canvas approximately 3-4 hours after the end of class. 3. **Questions**: Please don't hesitate to ask for clarification on any topics discussed in class. It's crucial not to let questions accumulate.